Executive summary

This report reports on basic initial data exploration of text corpus from 3 sources, namely twitter, news website and blogs. After data cleaning, the 3 corpuses contain roughly 4 millions line of sentences and 60 millions word. Sampling 20% from each corpuses we summarize the most common word used and up to 4-word phase and plot them below.

Introduction

This report reports on initial data cleaning and preliminary analysis of text corpus sourced from twitter, news and blogs on the internet, retrieved from Data science capstone project on Coursera, as part of the final project for the specialization

Cleaning data set

First I clean up the text to become tidy text by removing non-ASCII character from the text then turn all of them into lowercase and remove ‘Stop word’, which is word such as pronouns, the, etc. to decrease the computing burden in the machine learning step. Then I remove text emoticon and punctuation except for apostrophe.

Sample data before cleaning

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"

Sample data after cleaning

##                                                          sentence
## 1:      btw thanks rt gonna dc anytime soon love see way way long
## 2:      meet someone special know heart beat rapidly smile reason
## 3:                                                    decided fun
## 4: tired played lazer tag ran lot ughh going sleep like 5 minutes
## 5:              words complete stranger made birthday even better
## 6: first cubs game ever wrigley field gorgeous perfect go cubs go

Basic summary

Preliminary analysis included finding top 10 most common phrases in each corpus and summary of each corpus including line counts and word counts
noted that these are summaries after the data set has been cleaned and may not reflect the original data set. But since it is the set that will be used for training, we will analyze this data.

Summary of blog data set

## [1] "tidyblog data set has a total of 19751975 words and a total of 897371 lines"

Summary of twitter data set

## [1] "tidytwitter data set has a total of 17530938 words and a total of 2352326 lines"

Summary of news data set

## [1] "tidynews data set has a total of 20936984 words and a total of 1009854 lines"

Turn sentence into word or phrase

After some tinkering around, I decided to use 20% of each corpus for training data set. I tried to use all of the data set to create n-gram table but my desktop couldn’t handle it. I combined each corpus into new data set, then turn each sentence into short phase and word to count the most frequent word and phase used.

Combined data set

10 most common word in blog sample data

##      ngram     n
##  1:    one 25238
##  2:   just 20031
##  3:   like 19713
##  4:    can 19699
##  5:   time 18335
##  6:    get 14237
##  7:    now 12061
##  8:   know 11976
##  9: people 11963
## 10:    new 11107

10 most common 2-gram in blog sample data

##           ngram    n
##  1:   right now 1036
##  2:   years ago  995
##  3: even though  990
##  4:    new york  979
##  5:    year old  908
##  6:  first time  886
##  7:         1 2  875
##  8:   feel like  874
##  9:         u s  857
## 10:     can see  843

10 most common 3-gram in blog sample data

##                   ngram   n
##  1:       new york city 169
##  2:             1 2 cup 150
##  3:      new york times 135
##  4:    couple weeks ago  97
##  5: amazon services llc  96
##  6:       llc amazon eu  96
##  7: services llc amazon  96
##  8:             1 4 cup  92
##  9:               1 1 2  89
## 10:          new york n  70

10 most common 4-gram in blog sample data

##                                    ngram  n
##  1:           amazon services llc amazon 96
##  2:               services llc amazon eu 96
##  3:                         new york n y 68
##  4:         backgroun none repeat scroll 55
##  5:                 none repeat scroll 0 55
##  6:                    repeat scroll 0 0 55
##  7:                     scroll 0 0 yello 55
##  8:          style backgroun none repeat 55
##  9:                      0 0 yello class 53
## 10: advertising fees advertising linking 48

Plan for machine learning algorithm and shiny application

For the machine learning algorithm, I plan to use markov chain model from markovchain package and using 1-4 gram model for text prediction with stupid-backoff. Then I will save model for later use and load it into shiny apps to use for prediction after evaluation with validation data set.